Learning Output Kernels for Multi-Task 
Problems 



Francesco Dinuzzo 

O 



X 



Abstract 



Simultaneously solving multiple related learning tasks is beneficial un- 
der a variety of circumstances, but the prior knowledge necessary to cor- 
rectly model task relationships is rarely available in practice. In this 
paper, we develop a novel kernel-based multi-task learning technique that 

O automatically reveals structural inter-task relationships. Building over 
the framework of output kernel learning (OKL), we introduce a method 
I— I that jointly learns multiple functions and a low-rank multi-task kernel 

CZ2 by solving a non-convex regularization problem. Optimization is carried 

out via a block coordinate descent strategy, where each subproblem is 
solved using suitable conjugate gradient (CG) type iterative methods for 
linear operator equations. The effectiveness of the proposed approach is 
^ demonstrated on pharmacological and collaborative filtering data. 

^ 1 Introduction 

Combining multiple datasets for solving related inference problems is a common 
C*"") and successful practice in several domains such as econometrics and marketing 

analysis, recommender systems on the web, bioinformatics, pharmacology, and 
many others. Indeed, simultaneously solving multiple inference problems can 
improve performances, provided that the relationships between them are cor- 
rectly modeled. The themes of transfer learning, multi-task learning or "learn- 
ing to learn" [9, 19, 44] have attracted considerable attention in the literature, 
see [30] for a recent survey. A possible way to enforce relationships between 
the tasks is to extract a set of shared features. This can be done by em- 
ploying multi-layer neural networks where the hidden layer is shared among 
the tasks [9], or also within a convex regularization framework [4]. Task re- 
lationships can be also modeled within the framework of Gaussian Processes 
estimation [7,8,31,40,46,49] by designing a joint covariance function. A va- 
riety of other models have been proposed to represent and exploit inter-task 
relationships [3, 23, 32, 48, 50, 51] . 
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Correctly modeling the inter-task relationships for a given application is critical 
but not always easy. In fact, in many inference problems, the tasks are known to 
be related with each other, but the available prior knowledge is not sufficient to 
model the relationships in advance. For this reason, some recent works have been 
focusing on inferring inter-task relationships automatically from the data, while 
solving the multi-task learning problem. One possibility is to assume that the 
tasks can be clustered into homogeneous groups and try to learn such clustering 
from the data [6]. Optimization-based approaches that jointly infer the task 
parameters and the inter-task relationships in the form of a similarity matrix 
include the spectral regularization approach of [5] and the method of [52] based 
on convex optimization. An online method has been also proposed recently to 
learn multiple linear classifiers as well as a task relationship matrix [38] . 

Regularized kernel methods [39] have been employed successfully in a variety of 
single-task learning problems. Their extension to the multi-task setting [15] calls 
for the design of suitable operator- valued kernels that model similarities of both 
inputs and tasks. Some classes of operator-valued kernels have been recently 
reviewed in [2]. Within the framework of regularization in RKHS of vector- 
valued functions, the problem of inferring task relationships boils down to the 
problem of learning a multi-task kernel. A possible way to address this problem 
consists in learning a linear combination of task-specific kernels [47], within 
the framework of multiple kernel learning (MKL). Recently, a class of output 
kernel learning (OKL) methods [12, 13] has been introduced in the context of 
multi-output learning problems (such as vectorial regression, multi-class and 
multi-label classification) to automatically synthesize a decomposable matrix- 
valued kernel that encodes the relationships between the output components. 
Differently from MKL, OKL methods don't require the specification of a set of 
basis kernels to be combined, as the output kernel is searched into the whole 
cone of positive semidefinite kernels. 

This paper explores the possibility of automatically learning the relationships 
between multiple tasks via output kernel learning. To this end, we extend a 
technique for learning low-rank output kernels proposed in [11] to the multi-task 
setting. The low-rank encouraging regularization can be shown to be useful both 
for computational reasons and learning performances. The extension developed 
in this paper is based on the use of a suitable weighted loss that allows for 
multiple datasets with different input sampling patterns. The resulting method 
can be also interpreted as a non-linear kernel based generalization of low-rank 
matrix completion techniques. Even in the simpler setup of linear low-rank ma- 
trix factorization, the case of incomplete observations has been recognized to be 
computationally challenging, see e.g. [42], since existing optimization techniques 
based on eigendecompositions do not carry over to the weighted case. For this 
reason, we have developed a new strategy to solve the optimization problem. 
Similarly to [52] , we learn multiple tasks as well as their relationships by solving 
a joint optimization problem. However, differently from [52] and [38], that learn 
multiple linear functions by convex optimization, we learn multiple non-linear 
functions by solving a non-convex optimization problem. Albeit non-convex, the 
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problem exhibits a special structure that allows for effective optimization via 
a block-coordinate descent procedure based on the iterative application of the 
conjugate gradient (CG) algorithm for linear operator equations. Our method 
is also related to the technique derived in [8] in the context of Bayesian estima- 
tion of Gaussian Processes, that allows to learn a similarity (covariance) matrix 
between the tasks. While [8] aims at optimizing a marginal likelihood type 
functional, our method is based on the minimization of a functional with trace 
norm regularization plus rank constraint. In [8], the authors adopt a general 
purpose gradient-descent solver to optimize their objective functional, whereas 
in this paper we develop a novel optimization strategy that is specifically de- 
signed to solve the proposed OKL optimization problem. Differently from [8], 
our method is able to deal with the case of incomplete sampling, and auto- 
matically encourages low-rank solutions without introducing relaxations of the 
original problem. 

2 Weighted output kernel learning 

Let I denote the identity matrix and e the vector of all ones (of suitable di- 
mensions). For any matrix A, let A T denote the transpose, tr(A) the trace, 
rank(A) the rank, vec(A) the vectorization operator, A^ the pseudo-inverse, 

||A|| F = (tr(A T A)) 1/2 the Frobenius norm, and ||A||* := tr ((A T A) 1/2 ) the 

nuclear norm. For any pair of matrices of the same size A,B, let (A,B)_p := 
tr(A T B) denote the Frobenius inner product. The symbols <g) and denote 
the Kronecker product and the Hadamard (element- wise) product, respectively. 
Finally, let S™ denote the closed cone of positive scmidefinite matrices of order 
to, and 

§™' p = {A G S™ 1 : rank(A) < p) C 8™ 
the cone of positive semidefinite matrices with rank less than or equal to p. 

2.1 Output kernel learning for multi-task problems 

Consider the problem of learning several functions (tasks) gj : X — > M (j — 
1, . . . , to) from multiple datasets of input-output pairs (ary, 2/y). Since the input 
set X is common to all the tasks, the functions gj can be combined into a single 
vector- valued function g : X —> W n , to be learned from a single dataset of i 
pairs {xi, t/i), where the output data yi are now vectors that may contain missing 
components. Let W € {0, 1} xm denote a binary weight matrix specifying which 
output components are missing for each example. More precisely, the i-th row 
of the weight matrix is a binary vector Wi with zeros in correspondence with 
the missing components of yi. Since the weight matrix is always available, we 
can assume without loss of generality that all the missing output components 
have been imputed with zeros. 
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In the following, we describe a regularization approach where the function g is 
searched into a reproducing kernel Hilbert space (RKHS) % of vector-valued 
functions g : X — > M. m associated with a decomposable reproducing kernel H of 
the form 

H(x 1 ,x 2 ) = K x (x 1 ,x 2 ) ■ L, 

where Kx is positive semidefinite scalar kernel (called input kernel) and L G §™ 
is a symmetric positive semidefinite matrix (called output kernel). For more 
details about RKHS of vector- valued functions, see [27]. 

The function g € T-L and the output kernel L are simultaneously learned by 
solving a joint regularization problem of the form 

w l (3{g{x i )-y i )\\l \\g\\ 2 u tr(L) \ 

2A 2 2 j ' 1 J 



The objective functional contains a data-fitting term, taking into account the 
prediction error in correspondence with the observed output components, a 
regularization term on the unknown function g, and a trace regularization on 
the output kernel. Optimization of the output kernel is carried over the low-rank 
cone S™' p , thus also imposing a hard rank constraint. 

In view of the representer theorem [14,24], the minimization problem with re- 
spect to g admits a solution of the form 




g* (x) = L^dKxjx^j), 



with suitable coefficient vectors Ci € M. m . Letting K <E Sfj_ be such that ~Kij = 
K(xi,Xj), the following finite-dimensional optimization problem is obtained 



mm . 

lgs™- p V 2A 2 



Y-KCL||^ (C J KC,L) F tr(L) 



(2) 



where the matrices Y, C 6 M £xm are defined as 

Y = (j/i, . . . ,yi) T , C = (ci,...,q) t , 

and the (semi)-norm || • ||w is given by 

||A|| w = ||W0A|| F . 

Observe that Y is a sparse matrix (missing values have been imputed with 
zeros). Moreover, since W is a binary matrix with the same sparsity pattern of 
Y, we also have 

Y = W©Y. (3) 
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Within the formulation of problems (1) and (2), a first possibility is to simply 
set p = Tfi. With such a choice, the hard constraint on the rank is not present, 
but the trace regularization will still encourage low-rank solutions, see e.g. [16]. 
It is nevertheless convenient to allow for the more general choice p < m. A 
first reason is that, by providing an explicit bound p < m on the rank, it is 
possible to control the computational and memory requirements of the method 
before running the optimization. This holds since, within the optimization 
framework developed in the next section, only matrices of rank p are stored and 
manipulated, whereas the full matrix L is never formed explicitly. A second 
reason is that, when the regularization parameter A is small enough, the hard 
rank constraint becomes active, and may yield interesting solutions that are 
not obtainable by using only the trace regularization (see, for example, the 
experiment in subsection 4.1). 

The objective functional of (2) is not jointly (quasi)-convcx. However, it is 
convex separately with respect to both L and C. In addition, for p — m it 
is an invex function [28] in the interior of the feasible set, meaning that every 
stationary point is a global minimizer. When all the output components are 
observed, i.e. W = ee T , problem (2) can be attacked using techniques based on 
eigendecompositions. Since these techniques do not apply anymore for general 
weight matrices, in the following we develop a new strategy to obtain a minimizer 
of (2). 



2.2 Equivalent formulations 

Before going into the details of the optimization procedure proposed to solve 
(2), is it useful to introduce some equivalent reformulations of the optimization 
problem. The proofs of the two following Lemmas are reported in the appendix. 



Lemma 2.1 If (A, B) is an optimal solution of the following problem: 

/ ||Y-KAB^ + (A 1 KA) £ + HiY 

agR £xp V 2A 2 2 ) 

BeK mxp 

then any pair (C, L) such that 

L = BB T , A = CB, 

is an optimal solution of problem (2). 



Lemma 2.1 provides a new formulation of the low-rank output kernel learning 
problem (2) that involves only "thin" matrices A and B, whose size can be 
controlled by selecting the parameter p. Such formulation turns out to be par- 
ticularly convenient for optimization purposes. It is also insightful to observe 
that problem (2) can be reformulated as a reduced rank least square problem 
with a "kernelized" nuclear norm regularization. 
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Lemma 2.2 If® is an optimal solution of the following problem: 

' Y - K0 '^ + trf(0W /2 



mm 

eel"™ [ 2A 

subject to rank(0) < p, 

then the pair 

L=(0 T K0) 1/2 , = ^0, 

is an optimal solution of problem (2). 

Lemma 2.2 shows that the low-rank output kernel learning problem (2) is a non- 
linear kernelized generalization of least squares low-rank matrix approximation 
(RMF) with nuclear norm regularization. Indeed, RMF can be obtained as a 
particular case by choosing the input kernel as a Kronecker delta kernel 



K(xi,x 2 ) = S K {xi,x 2 ) 

so that 



1, X], = x 2 
0, else 



K = I, tr((0 T K©) 1/2, l = 110 



The related MMMF (maximum-margin matrix factorization) technique [33,43] 
would also correspond to the case K = I, but with hinge-type (SVM) losses 
instead of the square loss. 



3 A block coordinate descent strategy 

In the following, we focus on the solution of (4). If p is much smaller than to, 
handling the low-dimensional factor B is much more convenient than directly 
handling the full output kernel matrix. Let J(A, B) denote the objective func- 
tional of (4). Observe that, although J is non-convex, it is unconstrained and 
separately convex with respect to both factors. Therefore, it is natural to adopt 
a block coordinate descent technique that iteratively alternates between opti- 
mization with respect to the two blocks. Clearly, other optimization strategies 
are also possible, but the block coordinate descent approach is memory efficient, 
robust, and simple to implement. In addition, it turns out to behave well in 
practice, since very few iterations are typically sufficient to obtain a good solu- 
tion, especially when combined with a warm-start regularization path procedure 
(see subsection 3.3). As shown in the following, the two subproblems boil down 
to the solution of linear equations of the form 

£ A A. = y A , (5) 
C B B = y B (6) 

where Ca and Cb are linear operators mapping matrices into vectors. In the 
following subsections, we derive equations (5)-(6), and discuss the application 
of suitable iterative methods for solving them. 
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3.1 Sub-problem w.r.t. A 

The sub-problem w.r.t. A is the most numerically challenging of the two since, 
in general, the linear operator Ca is not symmetric. For any fixed B, a necessary 
and sufficient condition for A to be optimal is obtained by setting to zero the 
partial derivative of the objective functional: 



where we have used (3) in order to simplify the right hand side. If the weight 
matrix is full, this last equation reduces to a discrete-time Sylvester equation, 
a well studied class of linear matrix equations, see e.g. [41]. However, for gen- 
eral W, the optimality condition is not a Sylvester equation anymore, due to 
presence of the Hadamard product. Now, letting 



and using the matrix identities (8) and (10), the sufficient optimality condition 
can be rewritten in the form (5). 

If W is full, it is easy to verify, using the mixed-product identity (9) for the 
Hadamard product, that the operator Ca reduces to 



and is therefore symmetric and positive definite. In such a case, one can either 
apply the conjugate gradient algorithm, or also use a procedure based on eigen- 
decompositions in order to solve (5). Unfortunately, these methods cannot be 
applied for general weight matrices, since the operator La is not guaranteed to 
be symmetric. 

A first possible way to attack the problem is trying to directly obtain a solution 
of the non-symmetric operator equation (5) by means of iterative methods that 
can handle non-symmetric equations, such as the generalized minimal residual 
method (GMRES) [37]. However, the computational requirements of GMRES 
grow with the number of iterations and, for large scale problems, a restart 
strategy is needed to limit the memory requirements at the expense of conver- 
gence speed. The alternative is to introduce a change of variable to make the 
problem symmetric, and then apply a preconditioned conjugate gradient (CG) 
algorithm [20] on the new linear equation. Generally, CG requires less com- 
putational resources than GMRES, since the search directions are obtained via 




A sufficient condition is given by 



W (KAB T ) B + AA = YB, 



W D = diag(vec(W)), 

C A A = [(B T <g> I) W D (B ® K) + AI] vec(A), 
y B = vcc(YB), 



C A A = [(B T B) (g) K + AI] vec(A), 
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simple recursive updates and a restart strategy is not required. For additional 
details about these and others iterative method for solving linear equations, see 
e.g. [36]. 

Another way to address non-symmetry is to look for a change of variable that 
makes the problem symmetric. A first possible change of variable derives from 
the observation that the optimal solution must be in the form A = CB. By 
using this representation, one can equivalently solve a new linear equation with 
respect to C: 

W [K (W C) BB T ] + AC = Y. 

Once again, the equation can be rewritten in the form 

£ c C = vec(Y), 

where 

C C C = [W D ((BB T ) K) W D + Al] vec(C). 

This time, the operator Cq is symmetric and positive definite, so that CG 
applies. A disadvantage of this approach is that, for p <§C m, C is much higher- 
dimensional than A. Nevertheless, when the weight matrix W is sparse, C is 
also a sparse matrix with the same sparsity pattern, and one can take advantage 
of this property. 

Another approach is based on a factorization of the input kernel matrix of the 
form 

K = FF T . 

The factor F can be a Cholesky factor, a matrix square root, or also a low-rank 
factor. We can use the factorization to introduce a new change of variable 

A F = F T A, (7) 

so that the matrix Ap solves the equation 

F T W (FA F B T ) B + XAp = F T YB. 

This last equation can be rewritten in vectorized form 

C F A F = vec(F T YB), 

where 

C F A F = [(B T F T ) W D (B F) + AI] vcc(A F ). 

Independently of which factorization of K has been adopted, the linear operator 
C F is symmetric and positive definite, so that CG can be applied. Once a 
solution A F has been obtained, the matrix A can be recovered as a solution of 
the linear system (7). 

In our experiments, all of the aforementioned approaches have been tested. The 
last approach based on the Cholesky factorization of the input kernel matrix 
turned out to be generally faster than the others, therefore we selected it for 
our current implementation. 
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3.2 Sub-problem w.r.t. B 

The sub-problem w.r.t B is essentially a multiple ridge regression problem. 
First of all, observe that the objective functional depends on A only through 
the product E = KA. Therefore, when A is fixed, a necessary and sufficient 
condition for B to be optimal is 



In this case, it turns out that an expression for the rows of B can be even written 
in closed form. Indeed, let bj (j = 1, . . . ,m) denote the rows of B, yi and 
denote the columns of Y and W, respectively. Then, we have 



Observe that the updates for the different rows can be computed in parallel. 
Finally, by letting 



the optimality condition can be also rewritten in the form (6). Here, the operator 
Cb can be seen to be symmetric and positive definite, therefore it is possible to 
use a conjugate gradient method to obtain a solution. 

3.3 Implementation details 

Several important details must be taken into account in order to guarantee a 
correct and efficient convergence behavior. 

3.3.1 Warm-start path procedure 

It is typically necessary to train the kernel machine for several different values of 
the regularization parameter A. Generally, the regularization problem is better 
conditioned for large values of A. On the same time, the solution is expected 
to depend continuously on the regularization parameter. For these reasons, it 
is convenient to initialize the optimization of each problem with the solution 
obtained with the previous value of A, after sorting the different values of the 
regularization parameter in decreasing order. Such warm-start regularization 
path strategy is very effective in practice: if the grid of values of A is sufficiently 
fine, one or two iterations of block coordinate descent for each value of A are 
often sufficient to converge with a good accuracy. 





C B B = [(E T <g> I)diag(vec(W T ))(E ® I) + Al] vec(B) 



and 



y B = vec(Y T E), 
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3.3.2 Initialization 

It is possible to show that the rank of B cannot increase between an iteration of 
block-coordinate descent and the next. For this reason, at the very beginning 
of the warm-start procedure, B is initialized to a full-rank matrix. Otherwise, 
an optimal solution of high rank may be missed by the algorithm. Another 
issue comes from the fact that the origin is a stationary point of the objective 
functional for any value of A. Although, for very large values of the regulariza- 
tion parameter, the origin is actually a global minimizer, this is not the case 
anymore for small values. Now, in view of the warm-start procedure, the algo- 
rithm may never move away from the origin if the initial A is very large. To 
safeguard against this behavior, B is re-initialized to a full-rank matrix every 
time it becomes too small (for example, w.r.t. the Frobenius norm). 

4 Experiments 

In this section, we analyze the performance of weighted OKL on a variety of 
multi-task problems, including pharmacokinetic-pharmacodynamics (PK-PD) 
problems, and popular collaborative filtering benchmarks (MovieLens datasets). 
In all the experiments on real data, the performance of weighted OKL is com- 
pared with the following methods: 

• Independent single-task learning: corresponds to fixing the output 
kernel as L = I instead of optimizing it, thus learning each task indepen- 
dently of the others. 

• Pooled single-task learning: corresponds to fixing the output kernel 
as L = ee T , thus assuming that all the tasks are the same. 

• Regularized Matrix Factorization (RMF): corresponds to fixing the 
input kernel as the Kronecker delta kernel, so that K = I. 

All the experiments have been run in a MATLAB environment with an Intel i5 
CPU 2.4 GHz, 4 GB RAM. 

4.1 Reconstruction and denoising of multiple signals 

In order to analyze the dependence of the computation time on the rank bound 
parameter p of the proposed OKL method, we conducted some controlled exper- 
iments on synthetic data. The experiments show both the computational and 
the potential predictive advantage induced by the hard rank constraint. First of 
all, we generated 50 independent realizations Z k , (k = 1, . . . , 50) of a Gaussian 
Process on the interval [—1,1] of the real line with zero-mean and covariance 
function 

cov [{Z k ) Xl ,(Z k ) X2 \ = exp(-10|a;i - x 2 \). 
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Then, we generated m = 200 new processes Uj as 

50 
k=l 

where the mixing coefficients Bjk are independently drawn from a uniform dis- 
tribution on the interval [0, 1]. Output data yij have been generated by sampling 
the processes Uj in correspondence with 100 points in the interval [—1,1] ran- 
domly drawn from a uniform grid of 200 points, and corrupting them by adding 
a zero-mean Gaussian noise with a signal to noise ratio of 1:1. For each pro- 
cess Uj, only 70 out of the 100 inputs have been used for training, while the 
remaining 30 are used for tuning the regularization parameter A. 

The input kernel for OKL is the same as the covariance function of the processes 
Zk- Performances are measured by the reconstruction MSE (mean squared 
error), which is computable since we have also access to the generated non- 
corrupted signals. Figure 1 reports the MSE (left panel) and the computation 
time needed to train OKL over a whole range of values of the regularization 
parameter A (right panel), both as a function of the parameter p. Observe that 
the training time is roughly increasing with p since matrices of higher dimension 
are involved in the computation. 

The best performance (in terms of MSE) is observed in correspondence with an 
intermediate value of the rank bound parameter (the vertical line corresponds 
to p = 32). Hence, the best rank is considerably lower than the "true rank" of 
the model used to generate the non-corrupted data (namely 50). A mismatch 
between the two is to be expected for finite and noisy samples. Conditions 
under which the rank of the original model can be recovered will depend on the 



11 



sample size, the noise level, and the criteria used to choose the regularization 
parameter. It would be interesting to determine such conditions, a problem that 
we leave to future investigations. 

4.2 Pharmacological data 

Reconstructing response curves in multiple subjects from sparse sets of indi- 
vidual measurements is a typical problem in pharmacology. The curves are 
generally expected to exhibit similar shapes but, on the same time, consider- 
able inter-subject variability. 

The Study 810 dataset [18, 26] has been obtained from a multicentric clinical 
trial (Study 810) for testing the efficacy of paroxetine (an antidepressant drug). 
The dataset contains time profiles of the so-called Hamilton Depression Rat- 
ing Scale (HAMD) score for several patients suffering from major depressive 
disorders. The HAMD is an index obtained by processing a multiple choice 
questionnaire, that is used to measure the severity of a patient's major depres- 
sion. The patients under study were either treated with placebo or administered 
paroxetine at two different doses. The HAMD score was evaluated at several vis- 
its, planned for each patient at weeks 0,1, 2, 3, 4, 6, and 8. The drug is considered 
effective if a significant decreasing in the HAMD over the weeks is observed in 
the patients under treatment. Due to frequent dropouts (patients abandoning 
the study before its completion) , several of the scores are missing, especially in 
the last weeks. Patients with missing scores cannot be simply removed from the 
dataset, since dropouts are highly correlated with non-decreasing of the HAMD 
score, and their removal would bias the efficacy assessment. Reconstructing the 
missing scores is a multi-task learning problem where each patient corresponds 
to a task, inputs are the time instants, and outputs are the scores. A full-rank 
(p = m) weighted OKL method has been applied, by modeling the HAMD score 
time profiles as piece-wise linear functions. This can be done by simply using a 
linear spline input kernel: 

K(h,t 2 ) = l + mm{ti,t 2 }. 

The overall dataset contains 2855 scores for 494 patients. Following the setup 
of [13], we extracted a test set containing 1012 scores, including all the scores 
taken after the third week for a subset of 450 randomly chosen patients. The 
remaining 1843 examples are further divided into a validation set containing 
553 (about 30%) examples, and a training set containing the remaining 1290 
examples (about 70%). The random training/ validation selection is repeated 
50 times, and each time the RMSE on both the validation and the test set 
is computed. The regularization parameter is chosen so as to minimize the 
average validation error. Table 1 reports the average and standard deviation 
of the test RMSE obtained by using OKL, a nuclear norm regularized low- 
rank matrix approximation method (RMF), the pooled, and the independent 
baselines, showing a clear advantage of the OKL multi-task approach. The 
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numerical rank of the best found output kernel is 7. Training over a whole 
regularization path took about 36 seconds in the average. 

Table 1: Study 810 dataset: best average RMSE on test data (and their stan- 
dard deviation over the 50 splits) 



Pooled 


Independent 


RMF 


OKL 


6.86 (0.02) 


6.72(0.16) 


6.66(0.4) 


5.37(0.2) 



The PK-PD 27 dataset [29, 34] contains xenobiotics concentration time profiles 
for 27 human subjects, with samples taken at {0.5,1,1.5,2,4,7,12,24} hours 
after a drug administration. The goal is to reconstruct the continuous-time 
response curves over the non-negative real time semiline, a multi-task learning 
problem where the tasks are the concentration time profiles for each subject, and 
the inputs are the time instants. We apply the weighted OKL method described 
in this paper with full rank (p = m), as well as RMF, the independent, and the 
pooled baselines. The input kernel is chosen as 

K(t 1 ,t 2 )=t 1 t 2 W(h(t 1 ),h(t 2 )), 

where W is the cubic spline kernel 

jir, s XiX2m.m{x 1 ,x 2 } (min{xi, x 2 }) 3 
W{xi,x 2 ) = , 

and h is a simple transformation designed to obtain an asymptotic decay to zero 
of the response curve: 

h(t) = (1 + t) -1 . 

The kernel is designed so as to incorporate the available prior knowledge about 
the typical shape of a concentration response curve, which has to be smooth, 
with zero initial condition, and asymptotically decaying to zero. In order to 
simulate a realistic sparse sampling scenario, we follow the approach of [31], 
where only 3 measurements per subject (out of the 8 available) are randomly 
selected for training. The remaining samples are used for test. The selection 
is repeated 100 times, and the results are averaged. Table 2 reports the values 
of average RMSE in correspondence with the best value of A, again showing an 
advantage of OKL over the other methods. Figure 2 reports the average RMSE 
as a function of the regularization parameter for the different method. Figure 3 
reports the average training time of OKL as a function of the regularization pa- 
rameter. The numerical rank of the best found output kernel is 27 (therefore, in 
this case we get a full rank solution) . Training OKL over a whole regularization 
path took about 1.5 seconds in the average. 
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Figure 2: PK-PD 27 dataset: average RMSE on the test data as a function of 
the regularization parameter A. 




Figure 3: PK-PD 27 dataset: average training time of OKL as a function of the 
regularization parameter A. The total training time over all the values of A (area 
under the curve) is about 1.5 seconds. 
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Table 2: PK-PD 27 dataset: best average RMSE on test data (and their standard 
deviation over the 100 splits) 



Pooled 


Indep. 


RMF 


OKL 


15.4(0.68) 


13.4(1.55) 


14.6(1.97) 


12 (0.92) 



4.3 Collaborative filtering (MovieLens) data 

The MovieLens datasets 1 are popular collaborative filtering benchmarks, con- 
taining collections of ratings in the range {1, . . . , 5} assigned by several users 
to a set of movies. The goal is to learn the preferences of each user for all the 
movies. This can be interpreted as a multi-task learning problem, where each 
task is the preference function of one of the users. Currently, three datasets of 
different sizes are available, see Table 3. In addition to the ratings, the datasets 
contain additional metadata associated with the movies (e.g. genre and title), 
the users (e.g. gender, age, occupation) or the ratings themselves (timestamp, 
tags). 

Table 3: MovieLens datasets: total number of users, movies, and ratings. 



Dataset 


Users 


Movies 


Ratings 


MovieLenslOOK 


943 


1682 


10 5 


MovieLenslM 


6040 


3706 


10 6 


MovieLens 10M 


69878 


10677 


10 7 



The weighted OKL method described in this paper has been applied to the three 
MovieLens datasets. The input kernel uses the movie's metadata, while the 
similarity between the users is learned automatically from the data. Although 
the framework allows to incorporate any type of movie metadata, only the Id 
and the genre (present in all three datasets) have been used for simplicity. The 
input kernel has been designed as 

K( Xl ,x 2 ) = S K (x\ d ,xi d ) +exp(-d H (x{,x 9 2 )) , 

where 8k is the Kronecker delta kernel (non-zero only when the movie Ids are 
equal) , X ■) X ^ £1X6 binary vectors encoding the genre metadata, and dn is the 
normalized Hamming distance. It can be easily shown that K is a valid positive 
semidefinite kernel. In these problems, keeping the 8k kernel part is important, 
since it allows to treat two distinct movies with identical metadata as different. 

Learning performance are evaluated using both the root mean squared error 
(RMSE), and the normalized mean absolute error (NMAE). The latter is ob- 
tained by normalizing the mean absolute error (MAE) by a factor that de- 
pends on the range of the ratings, that is NMAE = MAE/(r max — r m j n ), see 

1 http : //www. grouplens . org/ 
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e.g. [17]. For the MovieLens datasets, we have r m ; n = 1 and r max = 5, so that 
NMAE = 0.25 • MAE. For MovieLenslOOK and MovieLenslOM, performance are 
evaluated on the test sets r a and r& provided with the datasets. The dataset 
MovieLens 1M does not come with predefined test sets, therefore we also ex- 
tracted a random test set containing about the 50% of the ratings for each user, 
a setup adopted in [22,45]. The regularization parameter is tuned automatically 
by minimizing over a broad range the performance measure (RMSE or NMAE) 
on a validation set containing 25% of the non-test examples for each user. Only 
the remaining 75% are used for training. The results are summarized in Table 4. 
In all the experiments, we use raw data without any normalization, and the rank 
of the output kernel has been limited to p = 5. The rank constraint turned out 
to be active for all the optimal kernels. The multi-task OKL method systemat- 
ically outperforms RMF, as well as the two single-task baselines. Other recent 
results on these datasets under various experimental settings can be found, for 
example, in [1,10,22,25,33,45]. 

Table 4: MovieLens datasets: test RMSE (above) and NMAE (below) for differ- 
ent dataset splittings for weighted OKL, RMF, the pooled and the independent 
baselines. 



Test 


Pooled 


Independent 


RMF 


OKL 


MovieLenslOOK 


r a 


1.0371 
0.2087 


1.0605 
0.2147 


1.0007 
0.1949 


0.9751 
0.1906 


n 


1.0423 
0.2099 


1.0809 
0.2181 


1.0296 
0.2019 


0.9958 
0.1942 


50% 


1.0209 
0.2043 


1.0445 
0.2109 


1.0300 
0.1993 


0.9557 
0.1893 


MovieLenslM 


50% 


0.9811 
0.1961 


1.0297 
0.2070 


0.9023 
0.1761 


0.8945 
0.1752 


MovieLenslOM 


r a 


0.9989 
0.1970 


1.0344 
0.2063 


1.6501 
0.2892 


0.9427 
0.1806 


n 


0.9810 
0.1926 


1.0211 
0.2034 


0.9296 
0.1806 


0.9141 
0.1756 


50% 


0.9441 
0.1846 


0.9721 
0.1915 


0.8627 
0.1642 


0.8501 
0.1624 
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5 Conclusions and future developments 



Learning multiple tasks and simultaneously inferring the relationships between 
them is possible by using a kernel-based method that learns a decomposable 
multi-task kernel from multiple datasets. By employing a block coordinate 
descent strategy and iterative solvers for linear operator equations, it is possible 
to efficiently obtain a minimizer that yields good predictive performances. The 
method obviates the issue of manually specifying the similarities between the 
tasks. In addition, a systematic learning performance improvement with respect 
to single-task baselines and standard regularized low-rank matrix approximation 
can be observed on several datasets. 

In the future, it would be worthwhile to extend the proposed method by using 
loss functions different from the the least square loss. Also, it would be interest- 
ing to extend the framework so as to exploit other types of structural knowledge 
about the task relationships, for instance along the lines of [4,35]. 

A Proofs 

Proof of Lemma 2.1 

Any optimal A £ M. £xp for problem (4) admits a unique decomposition of the 
form 

A = CB + U, UB T = 0. 

By letting L = BB T , it follows that L G S™' p , and we have 

KAB T = KCBB T = KCL. 

In addition, we have 

(A, KA) f = (U,KU) F (C T KC,L) F 
2 2 2 

> (C T KC,L) F 
2 

It follows that we can set U = and A = CB, without any loss of generality. 

□ 

Proof of Lemma 2.2 

Letting = CL, problem (2) can be rewritten as 

min [ H Y - K0 ilw + min jVe(L) ) , 
subject to rg(0) C rg(L), 



17 



where 

iVe(L) = i 0TK0 ' L ^ ■ tr(L) 



2 2 
A minimizer of AT@(L) can be expressed in closed-form: 

L = (0 T K0) 1/2 , 

so that 

min AWL) = tr ( (0 T K0) 1/2 

All the constraints are satisfied provided that 

rank(0) = rank(L) < p. 

a 



B Matrix identities 



We recall some identities involving the vectorization operator, the Kronecker 
and the Hadamard products, see e.g. [21]: 

vec (AXB) = (B T ® A) vec(X), (8) 

(A <g> B) (C <g> D) = (AC) ® (BD) , (9) 
vec (A B) = diag(vec (A)) vec (B) . (10) 
These identities hold whenever the sizes of the involved matrices are compatible. 
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